Goto

Collaborating Authors

 reward uncertainty


Mitigating Reward Overoptimization via Lightweight Uncertainty Estimation

Neural Information Processing Systems

Reinforcement Learning from Human Feedback (RLHF) has been pivotal in aligning Large Language Models with human values but often suffers from overopti-mization due to its reliance on a proxy reward model.


Appendix A Pseudocode of DRE-MARL

Neural Information Processing Systems

The pseudocode for DRE-MARL training is shown in Algorithm 20, which takes the following steps. The property of the received reward in this environment is set to be collaborative. It is a scenario with two agents and three landmarks. Navigation and Reference is that the target landmark of each agent is only known to its partner. We use the abbreviation REF to denote this environment.



Mitigating Reward Overoptimization via Lightweight Uncertainty Estimation

Neural Information Processing Systems

Reinforcement Learning from Human Feedback (RLHF) has been pivotal in aligning Large Language Models with human values but often suffers from overopti-mization due to its reliance on a proxy reward model.


Appendix A Pseudocode of DRE-MARL

Neural Information Processing Systems

The pseudocode for DRE-MARL training is shown in Algorithm 20, which takes the following steps. The property of the received reward in this environment is set to be collaborative. It is a scenario with two agents and three landmarks. Navigation and Reference is that the target landmark of each agent is only known to its partner. We use the abbreviation REF to denote this environment.



Overcoming Reward Overoptimization via Adversarial Policy Optimization with Lightweight Uncertainty Estimation

Zhang, Xiaoying, Ton, Jean-Francois, Shen, Wei, Wang, Hongning, Liu, Yang

arXiv.org Artificial Intelligence

We introduce Adversarial Policy Optimization (AdvPO), a novel solution to the pervasive issue of reward over-optimization in Reinforcement Learning from Human Feedback (RLHF) for Large Language Models (LLMs). Over-optimization occurs when a reward model serves as an imperfect proxy for human preference, and RL-driven policy optimization erroneously exploits reward inaccuracies. In this paper, we begin by introducing a lightweight way to quantify uncertainties in rewards, relying solely on the last layer embeddings of the reward model, without the need for computationally expensive reward ensembles. AdvPO then addresses a distributionally robust optimization problem centred around the confidence interval of the reward model's predictions for policy improvement. Through comprehensive experiments on the Anthropic HH and TL;DR summarization datasets, we illustrate the efficacy of AdvPO in mitigating the overoptimization issue, consequently resulting in enhanced performance as evaluated through human-assisted evaluation.


Roping in Uncertainty: Robustness and Regularization in Markov Games

McMahan, Jeremy, Artiglio, Giovanni, Xie, Qiaomin

arXiv.org Artificial Intelligence

We study robust Markov games (RMG) with $s$-rectangular uncertainty. We show a general equivalence between computing a robust Nash equilibrium (RNE) of a $s$-rectangular RMG and computing a Nash equilibrium (NE) of an appropriately constructed regularized MG. The equivalence result yields a planning algorithm for solving $s$-rectangular RMGs, as well as provable robustness guarantees for policies computed using regularized methods. However, we show that even for just reward-uncertain two-player zero-sum matrix games, computing an RNE is PPAD-hard. Consequently, we derive a special uncertainty structure called efficient player-decomposability and show that RNE for two-player zero-sum RMG in this class can be provably solved in polynomial time. This class includes commonly used uncertainty sets such as $L_1$ and $L_\infty$ ball uncertainty sets.


Tractable Objectives for Robust Policy Optimization

Neural Information Processing Systems

Robust policy optimization acknowledges that risk-aversion plays a vital role in real-world decision-making. When faced with uncertainty about the effects of actions, the policy that maximizes expected utility over the unknown parameters of the system may also carry with it a risk of intolerably poor performance. One might prefer to accept lower utility in expectation in order to avoid, or reduce the likelihood of, unacceptable levels of utility under harmful parameter realizations. In this paper, we take a Bayesian approach to parameter uncertainty, but unlike other methods avoid making any distributional assumptions about the form of this uncertainty. Instead we focus on identifying optimization objectives for which solutions can be efficiently approximated. We introduce percentile measures: a very general class of objectives for robust policy optimization, which encompasses most existing approaches, including ones known to be intractable. We then introduce a broad subclass of this family for which robust policies can be approximated efficiently. Finally, we frame these objectives in the context of a two-player, zero-sum, extensive-form game and employ a no-regret algorithm to approximate an optimal policy, with computation only polynomial in the number of states and actions of the MDP.


Solving Non-Rectangular Reward-Robust MDPs via Frequency Regularization

Gadot, Uri, Derman, Esther, Kumar, Navdeep, Elfatihi, Maxence Mohamed, Levy, Kfir, Mannor, Shie

arXiv.org Artificial Intelligence

In robust Markov decision processes (RMDPs), it is assumed that the reward and the transition dynamics lie in a given uncertainty set. By targeting maximal return under the most adversarial model from that set, RMDPs address performance sensitivity to misspecified environments. Yet, to preserve computational tractability, the uncertainty set is traditionally independently structured for each state. This so-called rectangularity condition is solely motivated by computational concerns. As a result, it lacks a practical incentive and may lead to overly conservative behavior. In this work, we study coupled reward RMDPs where the transition kernel is fixed, but the reward function lies within an $\alpha$-radius from a nominal one. We draw a direct connection between this type of non-rectangular reward-RMDPs and applying policy visitation frequency regularization. We introduce a policy-gradient method, and prove its convergence. Numerical experiments illustrate the learned policy's robustness and its less conservative behavior when compared to rectangular uncertainty.